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Abstract 

Equilibrium states of large layered neural networks with differentiable activation 
function and a single, linear output unit are investigated using the replica formalism. 
The quenched free energy of a student network with a very large number of hidden 
units learning a rule of perfectly matching complexity is calculated analytically. The 
system undergoes a first order phase transition from unspecialized to specialized 
student configurations at a critical size of the training set. Computer simulations of 
learning by stochastic gradient descent from a fixed training set demonstrate that the 
equilibrium results describe quantitatively the plateau states which occur in practical 
training procedures at sufficiently small but finite learning rates. 

Methods from statistical physics have been applied with great success within the theory 
of learning in adaptive systems. One prominent example is the investigation of feedforward 
neural networks which are capable of learning an unknown rule from example data [1], [| . 
Frequently, the training procedure, i.e. the choice of network parameters (weights), is 
based on an energy function which measures the agreement of the student network with 
the rule in terms of the given example outputs. Statistical mechanics techniques can be 
applied if training is interpreted as a stochastic process which leads to a properly defined 
thermal equilibrium [3-5]. A particularly interesting topic is that of phase trasitions in 
this context, see || for a recent review. In multilayered neural networks, for example, 
underlying symmetries can cause a discontinuous dependence of the success of learning on 
the size of the training set, see e.g. [14-19]. 

In this paper we present the first treatment of learning in fully connected soft-committee 
machines by means of the replica method. This type of network consists of a layer with K 
hidden units, all of which are connected with the entire input, and the total output of the 
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net is proportional to the sum of their states. Previous studies have addressed large soft 
committees (K — > oo) with binary weights within the so-called Annealed Approximation 



14] or networks with finite K in the limit of high training temperature p0| . 

Here, analytical results for the learning of a perfectly realizable rule at arbitrary training 
temperatures are derived (for very large K) within a replica symmetric ansatz. With 
an increasing size of the training set, the model exhibits a first order transition from 
unspecialized student configurations to specialized states with better performance. This 
transition is due to the invariance of the soft-committee output under permutation of 
hidden units. The same symmetry is known to result in quasi-stationary plateaus of the 
learning dynamics in on-line learning from a sequence of independent training examples [7- 
11], see Jl2| for a recent overview of this framework. Here, on the contrary, we will consider 



off-line learning from a fixed, limited set of examples. Furthermore we demonstrate that 
the statistical physics results, if interpreted correctly, describe the behavior of practical 
learning prescriptions. To this end we compare our results with the outcome of a stochastic 
variant of the well-known backpropagation of error algorithm |[[ 0, [ZTf . 

We investigate a student-teacher scenario where the rule is parametrized as 
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We assume an isotropic teacher with orthonormal weight vectors: B_j ■ B_ k = N 6jk for all 
j, k. The training of a perfectly matching student with outputs <r(£) = J2f=i g(xj)/ VK 
is considered, where the arguments Xj — J_j ■ £/ y/N are defined through adaptive weights 
J_j with J 2 = N. The particular choice of the hidden unit activation function, g(x) = 
erf (x/ a/2), simplifies the mathematical treatment to a large extent 0, §, ^0|. We expect, 



however, that our results apply qualitatively to a large class of sigmoidal functions including 
the very similar and frequently used hyperbolic tangent. 

Learning is guided by the minimization of the training error 

* = = ix>(u<},f)= pE^(-(f)--(f)) 2 (2) 

where P is the number of training examples, which we assume to scale like P = otNK 
with a = 0(1). The extensive quantity H = Pe t plays the role of an energy. The 
replica formalism for the calculation of the corresponding quenched free energy exploits 
the identity (InZ) = d (Z n )/ dn\ n=0 where (. . .) denotes an average over the set of random 
training examples. Z n is equivalent to the partition function of n non-interacting copies 
(lab led a — 1, 2, . . . , n) of the investigated system and reads: 



(3) 
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Here, the measure dfi is meant to incorporate the normalization J" 2 = A^ of the student 
vectors. We perform the quenched average over all possible sets of independent training 
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inputs the components of which are assumed to be i.i.d. Gaussian random numbers 
with mean zero and unit variance. One obtains the following form: 



(Z n ) = J dfi({J_i}) e~ PGr where G r 
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Here and in the following (. . .)^ denotes an average over the randomness contained in 
a single input vector. As the examples are independent, the quenched average over the 
training set factorizes. 

The sample average G r will only depend on the order parameters — J" • B_j/N and 
Qij = J-i • Jj/N . Similarly the generalization error e g = |((er — r) 2 )^, which measures the 
success of learning by averaging over arbitrary inputs is given by M 
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In this paper we restrict ourselves to networks with a very large number K of hidden 
units. Non-trivial results can be obtained in the limit K — > oo but with K « N by 
assuming that the relevant student configurations will be site symmetric: 



Here and elsewhere in the paper superscripts a, b label replicas, whereas i and j are hidden 
unit indices. The restriction (^) allows the system to assume unspecialized (R a = S a ) or 
specialized states (R a > S a ). Note that the output of a student will be 0{yfK) and thus 
on a different scale than the output of the teacher if C a is on the order of 1. So that the 
magnitudes of the outputs match, we assume that the hidden unit cross overlaps (C a ,p ab 
and S a ) are on the order of 1/K. As a consequence of this scaling one may show (|TI| that 
the joint distribution of r and the o a becomes Gaussian in the large K limit. 

In the following we use the notation g_ = (a 1 , a 2 , . . . , cr n , r) T , and define a matrix B 
such that <zjBg_ = J2a=i (°" a — T Y '■ For large K the Gaussian joint distribution of a is 
completely specified through the covariance matrix M = (^g_a T ^, the elements of which 
can be expressed in terms of order parameters. Hence one obtains the effective Hamiltonian 
G r , equation (01), 
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where the r.h.s. is a function of the site symmetric order parameters ([]). A saddle point 
integration gives 1/N In (Z n ) as the extremum (w.r.t. {Rm : Qm\) °f exp[— P G r + N s] 
where 

i „ K n 

s = - In / dn{{ J?}) 1] II S (Qti - N ±t ■ £) S ( R li - N Jt ■ (8) 

JV k,l=la,b=l 
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The entropy term can be calculated by means of a saddle point integration itself after 
writing the ^-functions in their integral representation. One obtains 



s = 1/2 ln(det C) + const. 



(9) 



where C is the [(n + i)fT]-dimensional square matrix of all cross- and self-overlaps of 
(replicated) student and teacher vectors (16], [HI. In the Appendix we sketch a simpler 



derivation of this result which avoids the saddle point method. 

In order to proceed with the analysis, we make a replica symmetric ansatz, in addition 
to site symmetry (0): R a = R, S a = S,C a = C and q ab = q,p ab = p for a ^ b. This assumption 
simplifies the evaluation of the determinants and allows for a straightforward treatment 
of the limit n — > 0. In agreement with the scaling of the hidden unit cross overlaps, we 
reparametrize: S = S/K, C = C/(K — 1), and p — p/K. The parameters A = R — S 
and 5 = q — p now measure the degree of specialization in the network. Inserting these in 
the saddle point equations we find that the condition df /dS = can only be satisfied, if 
C = K(l +C — 5 — p) = 0(1). After eliminating C accordingly, we obtain the free energy 
as a function of variables of order one: 
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with u = 1/3 + C/tt, v = [2arcsin(<5/2) +p\/ir, and w = [2arcsin(A/2) + S]/ir. Terms of 
order (1/K) have been neglected on the r.h.s. of equation fllpp . 

For a = 0(1), the saddle point equations yield two different types of solution: an 
unspecialized, committee symmetric branch with A = 5 = and specialized solutions with 
A, 5 > 0. In the first case we find p — S — 1 and (7 = 0, with the generalization error 
e g = 1/3 — 1/ V independent of both a and (3. In the specialized case we get C — 0,p — 1 — 5 
and S — 1 — A, while 5 and A as functions of a and f3 can be determined only numerically. 

Figure 1 (left) shows the generalization error as a function of a for three different values 
of f3. The system undergoes a first order phase transition from a committee symmetric 
state (R = S) to a specialized solution with R > S. At constant training temperature, 
a locally stable, specialized configuration appears at a (/3-dependent) value a min . For 
a > &giob{(3), the specialized solution becomes globally stable. Asymptotically, the corre- 
sponding generalization error e g and the training error e t decay like l/(a/3) for large a. In 
contrast to the unspecialized phase, at a given a the generalization error always decreases 
with increasing (3 in the specialized phase. 

It is important to note that an unspecialized configuration with constant e g remains 
locally stable for all a. For a given (3 the corresponding training error is constant with 
respect to the size of the training set, initially. At an additional critical value of a, the 
order parameter 5 = q—p which measure correlations between students in different replicas 
assumes a non-zero value, whereas in this phase A = R — S remains zero for all a. This 
transition does not affect the generalization error but it does cause a first order transition 
to a slightly higher value of the training error e t . The training error continues to increase 
and approaches its asymptotic value 1/3 — 1/tt while 5^1 for a — > oo. The latter 
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indicates that, asymptotically, a unique set of unspecialized student weights is chosen in 
all replicas. Due to the transition , the training and generalization error of the unspecialized 
configuration coincide in the limit a — > oo. 

Figure 1 (right) displays e t (a) for (3 = 100, where the above mentioned phase transition 
is located at a ~ 6.18 where the training error jumps to a slightly larger value. The 
transition within the unspecialized phase occurs at values of a which increase rapidly with 
the training temperature, for instance at a ~ 139 for /3 — 10 and a ~ 489 for (3 = 5. 



Our results parallel the findings of O, 16] and [19| for large multilayer networks with 



threshold activation functions. We have found essentially the same qualitative behavior in 
the limit of infinite training temperature |20| and by applying the Annealed Approximation. 
However, the transition within the unspecialized phase cannot be identified in these simpler 
frameworks. It is further quite possible that even the replica symmmetric decription of this 
transition is incomplete. For threshold activation functions it was observed in [HJ that 
this transition is affected by replica symmetry breaking, resulting in a lower critical value 
for a than predicted in replica symmetry and changing the nature of the transition from 
first to second order. A more detailed discussion of this transition for the present case will 



be given elsewhere [22 



The limit (3 — > oo is of particular interest and corresponds to potentially error free 
training with e t = for all a. Within our replica symmetric ansatz we find for (3 — > oo 
that the system switches from poor to perfect generalization (e g = 0) at a m i n = a g i b = 1, 
where the number of examples coincides with the number of adjustable weights in the 
network. This is a consequence of the smooth, different iable nature of the input-output 
relation in this type of network. Such a transition to e g = is not observed in networks 
with threshold activation functions and continuous weights. The achievement of perfect 
generalization observed in networks with binary weights is due to a completely different 
mechanism, i.e. a freezing transition in the discrete configuration space, see e.g. M, |5], |15 



It is of course a crucial question, whether our statistical mechanics treatment can give 
relevant results for practical applications. We have followed the standard approach and 
analysed a heat bath ensemble, i.e. a Gibbs distribution of network configurations. One 
might reproduce the Gibbs density in simulations of the learning process by use of an 
appropriate Langevin or Monte Carlo dynamics. However, these prescriptions are out of 
the question for practical applications in the case of continuous weights and differentiable 
outputs. Much faster and more effective methods exist, the most prominent one is certainly 
the so-called backpropagation of error |], ||, |2l| . 

When can we expect the statistical physics results to be relevant for such a practi- 
cal prescription? Under certain restricting assumptions one can show, for instance, that 
stochastic gradient descent produces a stationary distribution which approximates a Gibbs 
density in the limit of infinitesimally small learning rates. This has been investigated in 
detail for simple systems in the vicinity of local energy minima [^, |24], ^] . But heat bath 
results can be interpreted in a broader context. Whenever an algorithm yields network 
configurations with a probability which depends exclusively on the training energy, one 
could in principle analyse an appropriate ensemble. All such ensembles, including the heat 
bath, refer to the same microcanonical density. Hence, for fixed energy, the system chooses 
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Figure 1: Generalization error and training error as functions of a = P/(KN). 
Left panel: e g vs. a for three different training temperatures in the Gibbs ensemble. 
For each temperature, the leftmost dashed line indicates the occurance of a locally stable 
specialized state; the second vertical line marks a g / b where it becomes globally stable. 
Right panel: et vs. a for (3 = 100. An additional (first order) transition occurs at which e t 
begins to increase in the unspecialized solution while e g remains constant. For a — > oo the 
training error approaches the value et = e g = 1/3 — 1/ir. 



among the same set of possible states with equal probablity and the same macroscopic 
features emerge. Stability properties, however, will depend strongly on the considered 
ensemble which has to be specified in order to locate a phase transition, for instance. 

In Figure 2 (left panel) we have plotted the generalization error vs. the corresponding 
training error at a = 5 by eliminating f3 in all saddle point solutions (regardless their local 
or global stability). Clearly, this dependence could be derived from the microcanonical 
density as well. According to the above reasoning, the same graph is valid for all procedures 
which produce configurations with a purely energy dependent probability. 

In the following we demonstrate that the backpropagation algorithm appears to fulfill 
this requirement very well for a range of learning rates. To this end we have performed 
simulations of a stochastic version with updates 

(11) 



l! +1 = VN (Jl-^V 2i e({J*},fW)) / |j*-^6({J*},fW) 

The current training example j£ M (£),T M } is drawn randomly from the pool of P = aKN 
independent input-output pairs with probability \ jP at each time step. The learning rate 
r) controls the step size of this stochastic gradient descent and the weights are normalized 
explicitly. The number of hidden units was K = 10 in all simulations shown in Figure 2. 

In the course of learning one observes quasi-stationary states in which both et and 
e g remain almost constant over a large number of updates. These are reminiscent of 
the plateaus found in on-line training of soft-committees [7-9] where each example is 
presented exactly once. We have identified plateaus according to a heuristic criterion in 
our simulations and determined the corresponding values of e g and e t . Note that several 
such states can be approached successively while learning with a fixed rate r\. Details of 
the simulations will be explained in a forthcoming publication |22[ . 



Figure 2 (left) shows the observed pairs of values (e g , et) for learning rates 0.1 < r] < 4.0. 
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Figure 2: Stochastic backpropagation in a system with N = 150 and K = 10 at 
a = P/(KN) = 5. Dots represent values found in plateau states of single runs, see 
the description in the text. 

Left panel: Solid lines show t g vs. et as obtained from the Gibbs ensemble by eliminating (3 
and disregarding stability criteria. The dots display the data pairs observed in simulations 
with learning rates between 77 = 0.1 and 4.0. 

Right panel: e g as found in plateau states as a function of the learning rate rj. Note that 
for r?>2, e g can deviate significantly from the prediction. These results contribute to the 
set of points clearly above the horizontal line in the left panel. The generalization error 
increases drastically for r]>5 (not shown in the left panel). 

Simulation results are in good aggreement with the theoretical analysis for a range of finite 
t]. The algorithm favors configurations from either one of the two predicted phases, the 
occurance of states in between the specialized and the unspecialized branch is presumably 
due to the finite size of the system. The data with e g significantly larger than predicted 
correspond to plateaus found in simulations with relatively large 77. Figure 2 (right panel) 
displays the observed values of e g vs. rj. For small enough learning rates the predicted 
competition of specialized and unspecialized states is confirmed. For r/>2, the value of e g 
can deviate significantly from the prediction, its sudden increase at 77 ~ 5 is reminiscent of 
the presence of a critical learning rate in on-line learning from a sequence of uncorrelated 
examples 0, §. 

As argued above, the location of a sharp transition from poor to good generalization 
cannot be expected to carry over from the heat bath to backpropagation results. We could 
not establish a relation between the control parameters j3 and rj since the specific density 
of plateau states as produced by the training algorithm is unknown. Our simulations 
support, however, the assumption that it is purely energy dependent for reasonable 77. 
The calculation of student-student overlaps provides further evidence for this hypothesis: 
we find the predicted scaling C oc 1/K 2 for small learning rates, whereas C = 0(1) 
independent of K for large 77. Apparently, stochastic gradient descent with large learning 
rates prefers, among the states of a certain energy, those with highly correlated hidden 
unit vectors. 

In summary, we have presented an analytic description of learning in large soft-com- 
mittee machines by means of a replica symmetric treatment of the corresponding Gibbs 
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ensemble. A characteristic feature of this model is the existence of a first order phase 
transition from poor to good generalization at a temperature dependent, critical size of 
the training set. In the limit of error free training (j3 — > oo) the transition is to perfect 
generalization and occurs at a = 1. 

We expect our results to be relevant for a large class of practical algorithms which do not 
favor particular network configurations among those of equal training error. Simulations of 
learning by stochastic gradient descent with sufficiently small but finite learning rates show 
qualitative and quantitative agreement of plateau states with the theoretical predictions. 
This indicates that the considered training procedure provides network configurations with 
a purely energy dependent probability. The latter feature is lost if the learning rate is too 
large. 

We will provide a more detailed study of stochastic backpropagation in a forthcoming 
publication. Future research will furthermore address learning from noisy examples, unre- 
alizable rules, and the training of networks with a finite number of hidden units. 
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Appendix 



We want to calculate a volume of the form 

V(Q) = f dS S(NQ - J T J) = / rfj f[ 6 (NQ ab - J a ■ J b ) (12) 

J J a,b=l(a<b) 

where Q is a symmetric, positive definite (n, n)-matrix of overlaps and J is the (N, n)-matrix 
which is composed of the n vectors J_ a G 1R N . 

For a suitable orthogonal (n, n)-matrix o and a diagonal (n, n)-matrix D one can write 
Q as Q = o T DDo. We now apply the linear transformation J — > J Do to the above integral. 
Its determinant is det D N and we obtain 



V(Q) = j d] <5(o T D(iVl - J T J)Do) detD^. 
The Fourier representation of the (^-function yields 

<5(o T D(iVl - J T J)Do) = C n j dQ exp (zTr [Qo T D(A^l - J T J)Do 



(13) 



(14) 



The integration runs over symmetric (n, n)-matrices and C n = {2i:)~ n( ' n+1 ^ 2 2 n(jl ~ 1 ^ 2 , where 
the second factor arises from the fact that the off-diagonal elements are counted twice in 
the trace. Using 



Tr 



Qo T D(iVl - J T J)Do 



Tr 



DoQo T D(iVl - J T J) 



and transforming Q via Q — > o D QD o yields 



(5(o T D(iVl - J T JDo) 



C n detD- n ~ l J dQ exp(zTr [Q(M - J T J) 
C n det D- n -^(iVl - J T J) 



(15) 



and thus V(Q) = det n X V{NX). Now V(N1) is just a normalization constant and of 
course det D 2 = det Q. Hence, in the limit iV — » oo with n of order one, one obtains 

-^lnF(Q) = ^lndetQ + C(l). 

The case where one considers an additional (N, m)-M atrix B of m teacher vectors and 
wants to evaluate / d] S(NQ — J T J) d(NR — J T B) reduces to the above consideration by 
noting that the integral will not depend on the choice of B, as long as the matrix of teacher 
overlaps T = B T B/iV is held fixed. Thus, one may in addition integrate over all B which 
have correlation matrix T. 

For the system of K teacher vectors and nK replicated students we define the (n+l)K- 
dimensional square matrix of overlaps 



C 



Q R 
R T T. 



for which the above result yields equation (|). 
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